Statistical genomics

 

The development and application of statistical methodologies to help analyze and interpret data from 'omics' technologies

The human genome in a nutshell

Genome is more than a sequence

  • Human genome contains ~4M single nucleotide polymorphisms (SNPs)

  • No new SNPs - discovery of new SNPs appears to be saturated at ~8,500 deep sequenced genomes (Telenti A. et.al. "Deep sequencing of 10,000 human genomes" PNAS 2016 http://www.pnas.org/content/113/42/11901)

Genome is more than genes

  • SNPs – single nucleotide polymorphisms – and other genomic variants (CNVs, InDels, SVs) are located everywhere

  • Only 12% of SNPs are located in, or occur in tight linkage disequilibrium with, protein-coding regions.

Hindorff LA. et.al. "Potential etiologic and functional implications of genome-wide association loci for human diseases and traits" PNAS 2009, http://www.pnas.org/content/106/23/9362.long

Genome contains millions of regulatory regions

  • DNaseI hypersensitive sites
  • Histone modification marks
  • Transcription Factor Binding Sites
  • DNA methylation
  • Enhancers

Collectively referred hereafter as epigenomic or regulatory regions

Definition of Epigenomics

Epigenomic data = genome annotation data = regions other than DNA sequence, annotated as carrying functional/regulatory potential or having a biological property

Conrad Hal Waddington - the father of epigenetics

Genome annotation consortia

Genome in 3D

  • Genome is not linear - the spatial organization plays an important role in regulation of gene expression
  • Chromosome Conformation Capture sequencing technologies allow exploring long-range interactions and loop- and domain formation
http://www.pnas.org/content/112/47/E6456.long

Long-range interactions

  • Long-range interactions compartmentalize a variety of regulatory elements to regulate gene expression
  • Disruption of one or several of such interactions => gene expression changes => disease

Genomic data and biostatistics methods

Growth of genomic data

Gap between data generation & data understanding

There is a growing gap between the generation of massively parallel sequencing output and the ability to process and analyze the resulting data. New users are left to navigate a bewildering maze of base calling, alignment, assembly and analysis tools with often incomplete documentation and no idea how to compare and validate their outputs. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 analysis price tag.

McPherson, John D. “Next-Generation Gap.” Nature Methods 2009 http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.f.268.html

GenomeRunner – a global positioning system within the genome

Epigenomic enrichment analysis

What is enrichment analysis?

Enrichment analysis – detection whether a group of objects has certain properties more (or less) frequent than can be expected by chance

  • Gene set enrichment analysis - summarizing many genes of interest, such as differentially expressed genes, with a few common gene annotations (molecular functions, canonical pathways)
  • Epigenomic enrichment analysis - summarizing many genomic regions of interest, such as disease-associated genomic variants, with a few common genome annotations (chromatin states, transcription factor binding sites)

Enrichment = functional impact

Why do we care?

  • Hypothesis: SNPs in epigenomic regions may disrupt regulation
  • More significant enrichment => more SNPs in epigenomic regions => more regulation is disrupted (SNP burden)

 

Enrichment = functional impact

Why do we care?

  • Epigenomic elements enriched in SNPs (altered by them) may be the new therapeutic target

  • Systems biology approach - don't try to fix one faulty SNP, target the whole regulatory system affected by all SNPs

What GenomeRunner analyzes?

Experimental data vs. epigenomic data

  • Differentially expressed genes, promoters
  • Differentially methylated regions
  • SNPs, copy number variations, Insertions/Deletions, Structural Variants
  • ChIP-seq peaks
  • 3D interacting regions

Statistics of epigenomic enrichments

 

  • 6 out of 7 disease-associated SNPs overlap with epigenomic marks
  • How likely this to be observed by chance? (Hypergeometric test/Binomial test/Permutation test)

Statistics of epigenomic enrichments

Hypergeometric test

  • \(m\) is the total number of SNPs
  • \(j\) is the number of SNPs annotated with a property
  • \(n\) is the number of selected SNPs
  • \(k\) is the number of selected SNPs annotated with a property

Statistics of epigenomic enrichments

Hypergeometric test

  • \(m\) is the total number of SNPs
  • \(j\) is the number of SNPs annotated with a property
  • \(n\) is the number of selected SNPs
  • \(k\) is the number of selected SNPs annotated with a property
Selected SNPs Not selected SNPs Total
Annotated k j-k j
Not annotated n-k m-n-j+k m-j
Total n m-n m

Statistics of epigenomic enrichments

Hypergeometric test

  • \(m\) is the total number of SNPs
  • \(j\) is the number of SNPs annotated with a property
  • \(n\) is the number of selected SNPs
  • \(k\) is the number of selected SNPs annotated with a property

What is the probability of having \(k\) or more annotated SNPs among the selected \(n\) SNPs?

\[P = \sum_{i=k}^n{ \frac{{m-j \choose n-i}{j \choose i}}{{m \choose n}} }\]

Functional impact of SNPs

HDL cholesterol

  • Epigenomic enrichments are sorted from most to least significant
  • E118-H3K4me1_bPk-processed = "cell/tissue ID"-"factor ID"-"source""

Functional impact of SNPs

HDL cholesterol

  • SNPs associated with HDL cholesterol are enriched in activating histone modification marks in liver cells

Epigenomic similarity analysis

Epigenomic similarity

  • Hypothesis - Genomic variants associated with different diseases may be enriched in (= disrupt) similar regulatory elements.

Different diseases - similar mechanisms

Epigenomic similarity

Why do we care?

  • Epigenomic similarity among diseases - similar treatment strategies

  • Unknown genetic disorders can be matched to known diseases by epigenomic similarity

  • Patients can be matched by epigenomic similarity of their genomes

Epigenomic similarity

Epigenomic enrichment profiles

Epigenomic enrichment profile – SNP set-specific vector of epigenomic enrichment (\(-log_{10}\)-transformed) p-values

Epigenomic enrichment profiles

Comparison

\(-log_{10}\)-transformed epigenomic enrichment profiles can be compared using Spearman correlation, PCA

Comparing epigenomic enrichment profiles

PCA separates autoimmune disease-associated SNP sets as the most epigenomically distinct from others

Comparing epigenomic enrichment profiles

Validating epigenomic similarity

  • Shared loci - Jaccard statistics
  • Semantic similarity - minMim, misn, obsExp, relOverlap, sharedRels
  • Disease ontology similarity - jiang, lin, rel, resnik, wang

Validating epigenomic similarity

  • Shared loci vs. Literature Median Spearman = 0.46
  • Shared loci vs. Disease Ontology Median Spearman = 0.30
  • Epigenomic similarity vs. Shared loci Median Spearman = 0.61

Differential epigenomic analysis

Differential epigenomic analysis

  • Comparison of multiple sets of genomic variants
  • Hypothesis - some sets of genomic variants show consistently more/less significant enrichments

Differential epigenomic analysis

Why do we care

  • Better understanding of the regulatory mechanisms associated with subgroups of diseases

  • Finer stratification of patients by the epigenomic differences associated with their genomes

  • More precise treatment strategies

Differential epigenomic analysis

  • Define two groups of SNP sets (use epigenomic similarity analysis)
  • We are testing whether the level of enrichment differs between groups of SNP sets

Differential epigenomic analysis

Quiescent chromatin states, T helper cell-specific - enriched in disease SNPs in red cluster

Enhancers, T helper cell-specific - enriched in disease SNPs in green cluster

Cell type-specific epigenomic enrichment analysis

Cell type-specific epigenomic enrichment analysis

  • Epigenomic elements are cell- and tissue type specific

  • Hypothesis - a SNP set is more likely to be enriched in epigenomic elements from cell types relevant to phenotype

  • Cell type-specific epigenomic enrichment analysis identifies cell types where enrichments are the most significant

Cell type-specific epigenomic enrichment analysis

Why do we care

  • More focused understanding of functional impact caused by SNPs

  • Treatment of epigenomic abnormalities in relevant cell types

Cell type-specific epigenomic enrichment analysis

Detection of cell/tissue types with many regulatory marks enriched in SNPs

  • Global distribution of enrichments - the null model
  • Cell type-specific distribution of enrichments - the alternative model
  • Test whether cell type-specific enrichments are significantly higher than the global level of enrichments

Cell type specificity of the functional impact of disease SNPs is relevant to disease pathology

  • Alzheimer SNPs are enriched in brain-specific epigenomic marks

Personalized epigenomics

Experimental setup

  • 431 patients diagnosed with Systemic Lupus Erythematosus (SLE)

  • Exome sequencing - SNP sets

  • Focus on rare variants - patient-specific SNP sets

Why rare variants?

Why rare variants?

  • Common SNPs are largely the same in healthy controls and SLE patients

  • ~50% of rare SNPs are specific for either healthy controls, or SLE patients

Hypothesis

  • Patients may be diagnosed with the same disease, but have different underlying mechanisms

  • Patient-specific SNP sets may have similar epigenomic enrichment profiles

  • Patients can be classified by similarity of their epigenomic enrichment profiles

  • Subgroups of patients having differential epigenomic enrichments may have different clinical outcomes

Patient-specific SLE-associated rare SNPs

  • Five subgroups of patients with distinct epigenomic enrichments

Differential clinical parameters

  • Epigenomically distinct subgroups of patients also significantly differ by clinical attributes

Summary

  • GenomeRunner defines potential functional impact of SNP sets via epigenomic enrichment analysis

  • Epigenomic similarity analysis identifies regulatory similarity and differences among SNP sets

  • Cell type-specific enrichment analysis prioritizes cell/tissue type specificity of the epigenomic enrichments

  • Epigenomic enrichment analyses can be applied to any genomic signature, from disease-associated SNP sets to patient-specific genotypes

What's next?

Data integration

3D interaction data

  • Loops
  • Topologically associated domains
  • Stable/unstable 3D genomic regions
  • Strongly interacting regions

Enrichment methods development

Analysis: Population epigenomics

  • Busby, George (2016), “Genotype data for a set of 163 worldwide populations”, https://data.mendeley.com/datasets/ckz9mtgrjj/1
  • 2,643 individuals from 163 worldwide human populations.
  • Epigenomic and 3D similarities and differences in population genome architecture

Analysis: Disease epigenomics

  • GRASP: Genome-Wide Repository of Associations Between SNPs and Phenotypes
  • 2,082 GWAS studies.
  • Epigenomic and 3D similarities and differences in genomic architecture of complex diseases

Analysis: Personalized epigenomics

  • TCGA - 11,000 patients, 33 tumor types, 7 data types (genotypes, gene expression, methylation etc.).
  • Epigenomic classification of cancer subtypes
  • Survival differences associated with patient-specific epigenomic signatures

Acknowledgement

  • Jonathan Wren, Oklahoma Medical Research Foundation
  • Cory Giles, Oklahoma Medical Research Foundation
  • Lukas Cara, University of St. Thomas, Houston
  • Jianlin Cheng, University of Missouri, Columbia
  • Tuan Trieu, University of Missouri, Columbia
  • Bridget Thomson-McInnes, Virginia Commonwealth Univrsity
  • John Stansfield, Virginia Commonwealth Univrsity
  • Kellen Cresswell, Virginia Commonwealth Univrsity

Thank you